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Abstract 

We consider novel phylogenetic models with rate matrices that arise via the embedding of a 
progenitor model on a small number of character states, into a target model on a larger number of 
character states. Adapting representation-theoretic results from recent investigations of Markov 
invariants for the general rate matrix model, we give a prescription for identifying and counting 
Markov invariants for such 'symmetric embedded' models, and we provide enumerations of these 
for low-dimensional cases. The simplest example is a target model on 3 states, constructed 
from a general 2 state model; the '2^-3' embedding. We show that for 2 taxa, there exist two 
invariants of quadratic degree, that can be used to directly infer pairwise distances from observed 
sequences under this model. A simple simulation study verifies their theoretical expected values, 
and suggests that, given the appropriateness of the model class, they have greater statistical 
power than the standard (log) Det invariant (which is of cubic degree for this case). 
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1 Introduction 



Phylogenetic inference based on molecular sequence data typically involves the selection of one 
or more specific models for state substitutions. There is a well-known hi erarchy of classes of 
4x4 rate matrices, with varying complexity and numbers of free parameters ([Posada fc Crandall , 
1998). However, for a given data set it is not always (if ever) clear which s ubstitution model is 
most appropriate or "best". For example, the software package ModelTest (jPosada &; Crandall . 
19981 ) selects a model of nucleotide substitution that best fits a given data set under a likelihood, 
information theoretic or Bayesian framework. While it is useful to be able compare results derived 
from different models, given the dangers of over-parametrisation the question of which datasets best 
conform to which class of models is difficult to resolve. Elaborations such as allowing rate variation 
across sites and invariant sites are also standard ingredients which allow for more flexibility in 
data-fitting. More extreme measures, which relinquish the convention al picture of spec i es evo lution 
via Markov models on trees, lead to generalisations such as mixtures ( Pagel &: Meade! . 12004! ) (sites 
which have probabilities for following different models), "mosaics" ( Woodhams et all l2009l ) (edge 



classes or subtrees which have different weights for different models), or ultimately network models 
(jBandelt fc Dressl . Il992l : iHolland fc Moultonl . I2004T ) , :'or example using acyclic directed graphs. 

In recent work we have introduced so-called "Markov invariants" ( Sumner et al. . 20081 : Sumner Jar vis 
2009T ). which are polynomial quantities built up out of the phylogenetic patt ern frequencies or diver- 
gence array s . Mar kov invariants are distinct from "phylogenetic invariants" (jCavender &: Felsenstein 



19871 : lLakel . Il987h in that they are defined to behave as a (one-dimensional) "representation" of 
continuous unfolding of the Markov process. This means that as the Markov process proceeds in 
time by an amount r, the expectation value of the Markov invariants simply scales with a product 
of the multiplicative constants det(m e ) = e tr ^^ T , where m e = e® eT is the transition matrix associ- 
ated with the edge e. With this understanding, the "invariance" property of the Markov invariants 
is captured by the simple time-dependence 



r — > t + r 



e tr(Qe)T e tr(Q e y 



It is important to note that the definition of phylogenetic invariants stipulates no such constraint. 

Markov invariants are phylogenetically informative for the most general phylogenetic model, 
giving some information of both model parameters and tree topology, and can be implemented 
without the need for explicit parameter estimation (ie. via optimization of a likelihood function). 
These invariants generalise the "log Det" distance measure which has precisely this feature: pairwise 
distances can be directly estimated whose expected value turns out to be the sums of rate parameters 
multiplied by time. We have identified Markov invariants for diverse combinations of numbers 
of taxa and number s of characters. For exa mple there are three so-called "squangle" invariants 
( Sumner et all 120081 : 1 Sumner Jarvid . 120091 ) for quartets of taxa and four character states, whose 
values directly resolve and distinguish the three unrooted trees 12|34, 13 1 24, 14| 23 for the most 
general Markov model, without the need for parameter estimation. Indeed, there are consistency 
arguments to suggest that the Markov invariants, as defined simply as functions of the phylogenetic 
pattern data, are in fact identical to their maximum likelihood estimators (or, more technically, 
belong to the ideal genera ted by the solutions of t he likelihood equations). This is trivially true 
for the log Det estimator (lAllman &: Rhodesl . liooih and its proof in the general case is a subject 
of future work. 

Notwithstanding these promising developments for the general Markov model, it is still of 
great interest to have at hand tools for exploring the full range of models available for phylogenetic 
inference. In this vein, iHuelsenbeck et al\ ( 20041 ) identified up to 203 submodels of the general time 
reversible model (GTR) for four characters, the number being based simply on the combinatorial 
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problem of counting compositions and refinements as parameters are turned on and off at various 
positions in the rate matrix. 

In view of the discussion above concerning model classes, a natural criterion for model and 
sub- model selection is that of the multiplicative closure of the edge transition matrices, or semi- 
group property. Multiplicative clo sure i s sometimes held out to be required for establishing the 
unrootedness of phylogenetic trees (jlsaeyj, 12004 ; ISemple Steell . l2003h , but for a given tree it seems 
not to be necessary. Its importance arises more directly from the methodology and interpretation 
of phylogenetic reconstruction. In doing tree searches for example, a swing of a leaf edge from 
one subtree to another entails a cut-and-rejoin operation: the incoming and outgoing Markov edge 
matrices from the source node of the originating leaf edge must be multiplied, while the Markov 
matrix from the target edge where the leaf is rejoined, must be expressed in turn as the product 
of Markov matrices for two new edges. Again, the possibility of extinctions along some edges, or 
of incomplete taxon sampling, suggests that, to allow for correct marginalisation, multiplicativity 
is mandator y for a consistent interpr etation. It is clear that scarcely any of the GTR matrices 
identified bv lHuelsenbeck et al\ (|2004l ) will be multiplicative (indeed, only some well-known models 
comply, for example the symmetric models such as the Kimura model, as well as Felsenstein's TR, 
non-symmetrical model). Indeed it is easy to show that the GTR model itself is not multiplicative, 
and this poses serious interpretive questions if the GTR class is used in generalized models where 
more than a single rate matrix is implemented in the analysis. 

In this paper we introduce, and explore through Markov invariants, new classes of submodels 
for a given number of characters, generated from general models in smaller numbers of charac- 
ters. These we term 'symmetric embeddings'. Clearly, in essence, these submodels contain similar 
information to the originating model in lower dimensions (with a smaller number of characters, 
and fewer parameters), and it is the manner in which this intuition is realised in technical detail, 
which we wish to elaborate here. In contrast to the naive identification of submodels by the mere 
presence or absence of additional parameters however, our embedded models are by construction 
multiplicatively closed. Moreover we are able to adapt the technical setting of Markov invariants to 
this new situation, and so derive new invariants of different structure and polynomial degree from 
the standard ones, which play an equivalent role to them. Again, these new invariants fulfill the 
expectation (because their underlying models have fewer parameters) that they should be of lower 
degree than the standard ones. 

In Sj2] below we introduce the symmetrically embedded models. These are given in a general 
setting, but we concentrate in detail on the general 2 state model embedded into 3 character 
models, called the '2 ^ 3' case. A variety of Markov invariants, for diverse degrees and numbers 
of characters and taxa, is enu merated in |S1 after adapting the group rep resentation method for 
identifying Markov invariants ( Sumner et all . 2008 ; Sumner & Jarvid . 120091 ) to the present setting. 
The simplest case is again that of '2 ^ 3', with two taxa, where it is shown that, apart from the 
(degree 3) determinant function (guaranteed to exist for the general Markov model and any number 
of characters, and well known as the log Det measure), there are two additional quadratic degree 
invariants called J31 and /2,2- These are constructed explicitly and their properties are explored. 

Finally in $3] the paper is summarised, and the conclusions supported by some simple sim- 
ulated data analysed for comparison using "Det" invariant, or ^3 in our notation, as well as I3 x 
and I11 invariants. As expected, the invariants of lower degree (and "weight", see below), are 
apparently statistically better behaved, at least from this preliminary numerical test. The paper 
ends with some concluding remarks and prospects for further work. An appendix , ^Aj gives an 
adaptation of a technical representation-theoretic result from ( Sumner et all 20081 ) enabling the 
Markov invariants for embedded submodels to be enumerated and constructed as presented in §3 
for low-dimensional cases. 
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2 Symmetrically embedded character substitution models 



In this section we introduce the concept of symmetrically embedded models, concentrating initially 
on the two state case, and developing the analysis to be able to present the 2 3, 2 4 and 
2 K rate matrices in detail. The discussion finishes with an overview of the general case. 

Consider the rate matrix for the general Markov model on two characters, 

«=(T 4) =a ("? o) + s (o -i) s ° L «+<^ w 

where as usual in the Markov matrix m(t) := exp(Qt), the matrix elements m a f, have the interpre- 
tation 

m ab = P [X(t) = a | X(0) = b] 

for the random variable X(t) in character spac^ll describing the probability of change (along each 
edge after time t), 



Pa(t) = y^m ab p b (0), 



6=1 



or p(t) = m(t) ■ p(0), where we have K = 2 characters and the edge probability distribution is 
Pa (t) = F(X(t) = a). 

We have chosen to write Q in terms of the natural basis of column-sum zero 'stochastic gen- 
erator matrices' {L a ,Lp} of the group GL\(2); the subgroup of the general linear group GL(2) of 



inver t ible 2 x 2 matr ices together with the probabilitistic constraint of unit-column sums ([Johnson 



19851 ; iMouradl . I2004J ) . This is relevant for considerations of multiplicative closure of models, which 
might arise in applications where different rate matrices are allowed on different parts of a phyloge- 
netic tree; where potentially missing taxa may need to be inserted into edges; or where re-evaluations 
of phylogeny may require edge rearrangements. In this case of a general Markov rate model, in 
continuous time, closure of the product M\M 2 = exp Q\ exp Q 2 is guaranteed by the so-called BCH 
formula which requires closure of the commutator brackets [Qi, Q2] '■= Q1Q2 — Q2Q1 of the Q's: 

expQi expQ 2 = exp(Qi + Q 2 + ~[Qi,Q 2 ) + j^[Qi, [Qi, Q2]} - j^[Q2, [Qi, Q2]) + •••)• ( 3 ) 

Referring to the BCH formula ([3]), it is immediately clear that closure is assured if the rate 
matrices form a Lie algebra, which in this case follows as we have chosen the most general two-state 
model. Specifically, 

[L a ,Lp\ = L a — Lp = —[Lp, L a ] 
with of course [L a ,L a ] = = [Lp,Lp]. 

How can we use ([1]) to infer rate matrices for 'target' models on different numbers of char- 
acters (larger than 2)? A natural observation from the linear m-action on the array (vector) p a is 
that a similar linear action can be obtained not only on the components of p, but also on any ho- 
mogeneous polynomials in the components. Specifically we can regard the k + 1 distinct monomials 
in components of the p at fixed degree k, as the formal components of a new, k + 1-dimensional 
array. 

Consider for instance the case k = 2, k + 1 = 3, and the monomials p|, P1P2 = P2P1, 
and p\. We write four terms to emphasize that the new character probabilities, say Pi = p\, 



In our notation the the random change process is implemented by the left matrix action, so that the column sum 
of m is unity (the column sum of Q vanishes) . 
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P2 = 2piP2i P3 = v\ are really components of a symmetric array (tensor) p a b := p a pb, with of 
course p a b = pb a , and to motivate the choice of scaling; we use the bionomial expansion (pi +P2) 2 = 
p\ + 2p\p2 +P2 = 1- If we take the differential form of the change rule ([2]), dp/dt = Qp, and write 
the induced transformation on P (considered as a three component vector) as dP/dt = Q^P, it is 
then easy to infer by considering dp a b/dt and referring to (pQ). Following this through we find 
that 



Q 



(3) 




a 



-2 
2 -1 
1 



+ /3 



1 



= a4 3 )+/?4 3) . 



(4) 



By construction, the new generator matrices and l! form a subalgebra of the Lie algebra of 

(3) (3) 

GLi(3) and satisfy the same commutation relations as their 2x2 progenitors, namely [L a , L\ ] = 

(3) (3) 

L a Lg ■ Thus technically we have an embedding of the Lie algebra of GL\{2) into that of GLi(3), 
which we shall denote 2^-7-3 (and of course multiplicative closure for this class of 3 x 3 model is 
guaranteed) . 

The generalisation to the 2 ^ 4, or 2 K, character case is immediate. For K = 4 we have 



Q 



(4) 



/ -3a (3 

3a -2a-/3 2/3 
2a -a- 2/3 3/3 
\ a -3/3 / 



a 



-3 

3 -2 

2 







/ 



+ /3 



/ 




V 





3 

-3 



a4 4 )+/3L (4) 



based on a totally symmetric, rank three tensor p a b c := p a PbPc with binomial constants of propor- 
tionality derived, as above, and using the constraint {p\ + P2) 3 = 1 to form the vector P with four 
components Pi = p\, P2 = "ipiP^,, P3 = 3p\p2 and P4 = p\- In the 2 K case (corresponding 
to a rank k = K — 1 tensor array p ai ... afc ), the rate generator matrices L a K \ Li have lower and 
upper diagonal entries K—l, ■ ■ ■ , 2, 1 and 1, 2, • • • , A"— 1, respectively, with the diagonals ensuring 
that the zero column sum condition is satisfied. Again these matrices satisfy the same commutator 
bracket relations (Lie algebra) as their 2x2 progenitors and hence generate phylogenetic models 
that are guaranteed to satisfy the closure property. 



3 Markov invariants 

In this section we adopt the background context of group actions on which we base our general 
theorems on Markov invariants. Details are provided in E fAl where we restate our previous technical 
results. We then explore the counting of Markov invariants for symmetrically embedded models 
in diverse dimensions (number of characters of the generating model, embedding rank and hence 
number of characters of the derived model, number of taxa, and polynomial degree of the invariant) 
and tabulate several low-dimensional cases. Finally we give details of the quadratic degree invariants 
for the 2 3 case, and explicit constructions of them along with the cubic, determinant function 
for comparison. 

As explained above and in systematic terms in fjAj embedded submodels are associated 
with particular matrix group constructions, whereby the character probability distribution p for a 
starting model on K' characters, is regarded as a progenitor for a target model deriving from a 
composite tensor array. If the starting model has dimension K' and the tensor p is of rank k and 
totally symmetric (the only case we consider), then p has K = K' (K' +1){K' +2) . . . (K' +k — l)/kl 
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K' 


2 


2 


2 


3 


3 


4 


4 


k 


2 


3 


4 


2 


3 


2 


3 


K 


3 


4 


5 


6 


10 


10 


20 



Table 1: Identification of embedded submodels for low-dimensional cases of interest. (K',k,K) 
gives the number of characters of the progenitor model, the rank of the embedding tensor, and the 
number of characters of the target model, respectively, with K = K'(K' + !)(■ ■ ■ )(K'+k— l)/kl. 



components, and so K is number of characters of the the target model. Table [U gives a list of 
several cases of interest for low-dimensional examples; of course K' = 2, k = 2, 3, • • • K— 1 are the 
2 >— > 3, 2 <—t 4, and 2 ^ K cases already identified. 

Markov invariants (see ISumner et al\ (|2008l ) ; ISumner Jarvisl (j2009l ). and §Aj) are formally 



polynomials in the components of a phylogenetic tensor P with components P ai a 2 ...a L representing 
the probability of observing the character state pattern (ai,a2,... , a£) at the leaves of the tree. 
Hence for L taxa P is a tensor with k L components, indexed by L sets of k multi-indices. Markov 
invariants are constructed such that, under time evolution associated with the model on the pendant 
edges, they change at most by a multiplicative factor. For clarity, at the pendant edges of the tree 
let rrii, i = 1, • • • , L, be the K' x K' transition matrices of the starting model, and denote the 
embedding into the target modeH as Mj = M(mj), i = 1, ■ ■ ■ , L. Then given the transformation 
rule, (|A-1|) . for P for pendant edge evolution under the model, a Markov invariant / must satisfy 
(|A-4p . namely 

I(P') = det( mi ) wi det(m 2 ) W2 ■ ■ ■ det(m L ) WL I(P), 

for some integers Wi. We note that for a continuous-time Markov model with rate matrices Qi, we 
have rrii = e® iT and det(mj) = e tr ^^ T , as in the introduction. 

Recall that a partition fi of an non-negative integer m is a set of non-negative integers 
Ai, A2, • • • , A r such that X% + X2 + ■ ■ ■ + X r = m. It is usual to write fj, = (Ai, A2, • • • , A r ) with Ai > 
A2 > • • • > A r and to use exponents for repeated parts. For example we write \i = (4, 3, 3, 3, 2, 2, 1) = 
(4, 3 3 ,2 2 ,1), with /i being a partition of 18. Markov invariants / of degree D are then identified 
by associating them with certain partitions of special shape, which in turn label certain irreducible 
group representation characters of the general linear group. 

Whether admissible fi arise at each D and number of leaves L, and how many occurrences 
thereof, must be answered for each case by evaluating a certain representation-theoretic branching 
rule (see ^Al Theorems 1 & 2, for details). Instances of such invariants are enumerated in Table 
[2] for the cases identified in Tabled) They are listed by K' , k, K (to define the embedding type), 
by L for small numbers of leaves, and then by degree D up to 4. From the tables it is evident 
that there exists a plethora of Markov invariants for embedded submodels. Further information on 
the independent invariants for phylogenetic tensors constructed under the Mar kov model can be 



access ed by studying the isotropy subgroup of leaf permutations on a tree, as in ISumner h Jarvis 



(|2009h . We defer discussion on the general results, including commentary on cases of possible 



biological interest, to the conclusions. 

For now we resume consideration of the lowest dimensional situation which motivated the 
present study, 2 3, and the lowest-degree (quadratic) invariants for the simplest situation of two 
leaves (L = 2), namely K' = 2, k = 2, K = 3, L = 2, D = 2. Here we outline briefly the manner in 
which these objects are constructed by standard tensor symmetrisation techniques. The end result 
will be the explicit forms ([5]) and (|6|) below. 

Recall that we handle the 3 state embedded model via a rank two phylogenetic probability 
2 In the previous discussion the corresponding rate matrices were distinguished by a superscript, t - K \ 
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D 


L 




(2,2,3) 


1 


2 


1 






3 


1 






4 


1 




2 


2 


5 






3 


14 






4 


41 




3 


2 


9 






3 


58 






4 


401 




A 

4 


2 


2d 






3 


321 






4 


5597 


(2,3,4) 


1 


1 


1 






3 


1 






4 


1 




2 


2 


8 






3 


32 






4 


128 




3 


2 


26 






3 


292 






4 


3 464 




4 


2 


100 






3 


3 688 






4 


158 384 





D 


L 




(3,2,6) 


1 


2 


1 






3 


1 






4 


1 




2 


2 









3 









4 







3 


2 


2 






3 


4 






4 


8 




A 

4 


2 


4 






3 


31 






4 


274 


(3,3,10) 


1 


2 


1 






3 


1 






4 


1 




2 


2 









3 









4 







3 


2 


5 






3 


13 






4 


41 




4 


2 


19 






3 


338 






4 


6 532 





D 


L 




(4,2,10) 


1 


2 


1 






3 


1 






4 


1 




2 


2 









3 









4 







3 


2 









3 









4 







A 

4 


2 


2 






3 


4 






4 


8 


(4,3,20) 


1 


2 


1 






3 


1 






4 


1 




2 


2 









3 









4 







3 


2 









3 









4 







4 


2 


2 






3 


4 






4 


8 



Table 2: Enumeration of linearly independent candidate Markov invariants for the embedded models 
listed in Table Q] above, for small numbers of taxa L, and degrees D up to 4. (K', k, K) gives the 
number of characters of the progenitor model, the rank of the embedding tensor, and the number of 
characters of the target model, respectively. The linear invariants simply record overall probability 
conservation for each phylogenetic tensor. The invariants I3 1 and ^2,2 studied in this paper are the 
two nonzero algebraically independent quadratic invariants from the count of 5 identified for the 
(2,2,3) model for D = L = 2. 
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array p ai a 2 = Pa 2 ai- Given the probability sum pu + p 12 + P21 + P22 = 1> the correct transcription 
between the two-state and three-state basis is a relabelling pu — > Pi, p\2 — > \P2, P21 —> \Pi, 
P22 ~^ P3 between p aia2 and a three component vector Pj. By the same token, for a model on 2 
leaves, the phylogenetic tensor will be an object p a b : a/3 built from edge transition matrices and a 
root probability in the usual way. Quantities at quadratic degree p a b,ai3Pcd,yS therefore admit only 
certain compatible tensor symmetrisations between the index labels a, b, c, d and a, f3, 7, 5. Table [2] 
lists 5 invariants corresponding to symmetry types identified in the discussion in fjAj Here we take 
up the invariants ^1, and 12,2, respectively. 

Consider for example the A = (3, 1) form quadratic in the components^, written down ac- 
cording to standard row and column Young symmetrisation and antisymmetrisation operations (on 
the sets a,b,c,d and a, /3, 7, 8 separately jj: 

yVabc.a/3j = p(ab,a/3)p(cd,j5) + p(ac,af3)p(bd,j5) - p(bd,a(3)p(ac,j5) - p(cd,a/3)p(ab,^5) 
d '5 

+ p(ab, jf3)p(cd, a5) + p{ac, ^I3)p(bd, a5) — p(bd, (3~/)p(ac, a5) — p{cd, j3^)p{ab, a5) 

The next step is to identify the part of this array which provides the Markov invariant. This is 
most natural in a transformed basis for the character states in which the pro bability mass is treat ed 
as a separate (constant) component '0' or '*' (for details see Appendix A of Sumner et al. (j2008l )). 



In the present case, after implementing this basis transformation, the unique invariant component 
(identified as i^i) becomes (up to an overall factor): 

2 2 

^3,1 := >%)0.000 = Yl Yl Wqbc.aP-y ■ 

2 '2 a,fe,c=la,/3,7=l ^ ^ 

Finally, reverting to the natural 3 state basis k = 1, 2, 3 interpreting p a b,a/3 as a 3 x 3 array Pij 
via the rules 11 — > 1, 12 — >■ |2 <— 21, 22 — > 3 already discussed, we have the concrete realization 

h,l = 4(^33 + 5OP23 + ^32) + 1^22) " 

4(iP 12 + \P 22 + ^32 + Pl3 + P23 + P33) ■ (5^21 + 1^22 + §^23 + ^31 + ^32 + P33) • (5) 

An analogous procedure yields the invariant 

-72,2 = -P33 + 2(^33 + 2K-P23 + -P32) + \P22) 2 + 

(Pl3 + -P23 + -P33) • (-P3I + -P32 + -P33) 

— 2(^Pl2 + \P22 + ^32 + P\3 + -P23 + -P33) ' (^23 + -P33) 

- 2(±P 21 + \P 2 2 + \P 2 3 + ^31 + ^32 + P33) • (1^32 + P 33 ) • (6) 

The crucial property of the 731 and ^2,2 quantities as invariants of weight (from £|A"j) w = 1 and 
w = 2, respectively, is how they transform under phylogenetic evolution. Namely, as t — > t + r, we 
have 

-73,1 -» 4,1 = det(rai) det(m 2 )i3,l, ^2,2 -» 4,2 = det(mi) 2 det(m 2 ) 2 /2,2, (7) 
where mi = e® iT . 

Table [2] also lists several invariants at degree 3 (see SjAJ. One of these, ^3 in our notation, 
is nothing but the Markov invariant coming from the general Markov model on 3 states, and well 



3 Writ ing p a b, a g as p (ab, a/3) for clarity. 
4 See ijSumneil [2006) for details 
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known via its log (as used for distance and likelihood studies) as the log Det measur^fl, whose form 
(as a cubic polynomial) is obvious and will provide a standard by which the behaviour of and 
^2,2 under simulation can be compared (see the concluding results and discussion below): 

Det = 13,3 = -P1I-P22-P33 + -P12-P23-F3I + -P13-P32-P2I — -P1I-P23-P32 — -P13-P22-P3I — -pL2-f > 21^33- 

Of course the Det is itself a Markov invariant of weight w = 3 and satisfies the transformation rule 

Det -> Det' = det(mi) 3 det(m 2 ) 3 Det. 



4 Results 



Is this section we present a simple simulation study that compares the performance of the Markov 
invariants Det, J22 and J31 as pairwise distance estimators for data generated under our symmetric 
embedded model. We do this by taking the theoretical probability distribution for a two leaf tree 
generated under the model and then sampling from the multinomial distribution for a range of 
sequence lengths. We discuss the derivation of unbiased estimators of the invariants and observe 
that the invariants I22 and J 3 i have superior statistical estimation power over that of the (log) Det. 
This adds credibility to the intuitive notion that it should be invariants of lower degree and lower 
weight that can be expected to perform best in practical contexts. 

If we take a root probability distribution 7Tj, i = 1,2,3, the "starting" (zero-edge length) 
probability distribution on a two leaf tree is P~ = 7Tj, P°- = if i 7^ j. For non-zero edge lengths, 
J3 1 and ^2,2 are then determined by this P° and the transformation rules ([7]). Starting with the 
rate matrix Q as in ([1]), we have the standard form 

™w = ^=(^ ^a). 

where A = ^qrg(l — e - *- "^ 4 ), with determinant det to = exp(— (a + /3)t). Thus choosing as inde- 
pendent parameters a and t, with a + /3 = 1, if the edge distances are t\ and t 2 , the theoretical 
values for I3 1 and I 2) 2 become after evaluating them on P°, 

h,i = e-* 1 e-* 2 (47r 1 7r 3 + iriir 2 + 7r 2 vr 3 ), I 2>2 = e" 2tl e^ 2t ' 2 |(vr| + 8tti tt 3 ). 

These are to be compared to the "Det" function, with theoretical value 

Det = e -3 * 1 ^ 3 * 2 (71-17^3). 



We take as data an alignment of two sequences consisting of three character states corre- 
sponding to i = 1, 2, 3 reduced to the pattern frequencies 

Fij := {number of occurrences of the pattern (ij)}. 

It is possible to use this data and the above invariant functions to obtain an estimator of the 
pairwise distance A = t\ + 1 2 under the symmetric embedded model 2 3, as follows. 



5 See Sumner et al. ( 20081) for comments on the rela tionship between log Det and other Markov invariants for the 
general model, and also I Sumner fc Jarvisl (2005, 200Cj|). 
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Consider the estimators of the invariants constructed by simply making the "obvious" re- 
placement Pij ->• fij := jjFij given bj@ 

h,i :=4(/ 33 + |(/23 + /aa) + J/22)- 

4(^/12 + 2/22 + ^32 + /l3 + /23 + /33) ' (5/2I + 5/22 + f/23 + /31 + /32 + /33) 3 
-^2,2 := /33 + 2(/ 3 3 + |(/23 + /32) + j/22) 2 + (/l3 + /23 + /33) ■ (/31 + /32 + / 3 3) 

— 2(|/l2 + 5/22 + 5/32 + /l3 + /23 + /33) ' (|/23 + /33) 

— 2(5/21 + 5/22 + 5/23 + /31 + /32 + /33) - (f/32 + /33) 1 

Det :=/ll/22/33 + /12/23/3I + /13/32/2I — /ll/2 3 /32 — /13/22/3I — /l2/2l/33- 

We work under the assumption that frequency array [Fij] 1<i j <3 is generated by sampling N pat- 
terns with probability generated under our model on a two taxa tree. This means that the 
probability of observing a given frequency array [i ? ij] 1< j J<3 is given by the multinomial form 



p-Pn 0^12 0^33 

; JT 1 1 z 1 o ... J 33 • 



Fn!-Fi 2 ! • • • F33! 11 12 
Under these conditions it is easy to show using generating function techniques that we have 



E 



N(N - l)4 2 } + N(I^( - (P 22 + 2P 32 + 2P 23 + 4P 33 )), 



where ij^l and ij^l are the quadratic and linear parts of I 3j i respectively (see Sumner et al. (|2008h 



^3,1 



7^ J3 1, so that I^i is a biased 



for example calculations of this kind). Thus we observe that E 
estimator of J31. 

This is easily rectified by defining the unbiased estimator: 

(if} + {N- + F 22 + 2F 22 + 2F 23 + 2F 32 + 4F 33 ) . 

This estimator then takes on the expectation value 



7 3,i ' — N{N 



E 



■■Mb) 
J 3,l 



^3,1 



-(*l+*2) 



(47T17T3 + 7Ti7T 2 + 7r 2 7T3). 



A similar observation for I 2)2 leads to its unbiased estimators 

hi) + (N ~ 1)1$ + (^33 " |^22 + \-F 23 + iF 32 ; 

rDet, 



r(ub) ._ 
Jo 



Det 



2,2 
(«6) 



JV(JV-l) 

1 

N(N-l)(N-2)' 



with expectation values 



C(u6) 
-'2,2 



/ 2i2 = e -2(* 1+ * 2 )l (7r 2 + ^ i7r3)) 



Det 



-3(ti+t 2 ) 



(7Ti7r 2 7T3). 



We now define the marginalizations := Ylj Fij an d := Yli^ij an d> as is standard 
for the "log Det", estimate the values vri,7r 2 ,7T3 by assuming the process was stationary and taking 
the harmonic means: 



7T; 



F (l) p(2) 

i i 



3 More formally, this equates to taking the fij as our best estimate of the probabilities Pij for this data. 
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With these estimators in hand, we are now in a position to define six reasonable estimators 
of A = t\ + i 2 that can readily be evaluated directly from the pattern counts Fij: 



A Det 




^3,1 J + log (47T17T3 + 7fl7f 2 + 7r 2 7r 3 ) , 
?2,l) + lo g (^l 2 + 8vfl7f 3 ) , 

Det^j + log (vfi) + log (vf 2 ) + log (vf 3 ) , 

?3,l) + lo g ( 4 ^3 + 7fl7T2 + K 2 n 3 ) , 
?2,l) + lo g (^l 2 + 8vfl7f 3 ) , 

Z?et ( " b) ) + log (vfi) + log (vf 2 ) + log (vf 3 ) . 



(8) 



To compare the performance of these estimators we performed a simulation study over a range 
of sequence lengths, from very long N = 10 6 to very short N = 10, with fixed rate parameters 
a = 0.45, j3 = 0.55 and t\ + i 2 = 1- The results are presented in Figure [TJ Careful inspection of the 
results shows that it is consistently the lower degree invariants that have greater statistical power 
and that taking unbiased forms also provides a significant improvement. 

5 Conclusion 

In this paper we have described a novel approach to phylogenetic model construction using sym- 
metric tensor embeddings that ensures multiplicative closure. Although this model construction is 
of interest in its own right, we went further to exploit the simple structure of the embedding to 
give examples of Markov invariants for these models. We also showed how these invariants can be 
exploited effectively as distance estimators with favourable statistical properties (as compared to 
the standard "log Det"). 

We should emphasize that we do not make any claim that the symmetrically embedded models 
discussed have any particular direct appeal as biologically realistic rate matrices for molecular 
phylogenetics - this is indeed why we did not shy away from considering the 3-state case in detail. 
We do however argue strongly that the multiplicative closure that is present in our model is itself 
highly desirable from a biological perspective, and make the point that the general time reversible 
model (perhaps the most popular currently in use) does not satisfy the closure property. 

The work presented in this paper thus serves as an elementary example that illustrates how 
one may go about constructing models with multiplicative closure. We are currently directing work 
into expanding our knowledge of such "closed" phylogenetic models and we expect that furthering 
the connection to Lie algebras will be vital in this regard. 

A Appendix: Groups and representation theory in phylogenetic 
models 

A.l General results 

The Markov model for phylogenetic branching sketched in Sj2] above can be given a formal setting 
for considering phylogenetic invariants and related constructions, and in particular, Markov invari- 
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Figure 1: Comparison of performance of Markov invariants. The barplots give the mean 
(dark shade) and variance (light shade) of 10 7 runs with sequence length ranging from N = 100 to 
N = 100,000. 






ants ( Sumner et al . 20081 ). In this appendix we recall our main results from that paper, and the 
representation-theoretic results which, appropriately generalised here, lead to the enumeration and 
construction of Markov invariants for the embedded submodels which are the subject of the present 
work. 

The starting point of our approach is to regard the model of stochastic change via Markov 
matrices, ( 2]), in terms of l inear actions of certain matrix groups G affiliated with stochastic matrices 



(Johnson 



19851 : iMouradl - liool l. Most generally, consider a phylogenetic tensor P of rank L (the 



array of pattern frequencies for L leaf edges of a presumed phylogenetic tree under the general 
Markov model). The equation equivalent to (|2|) describing the evolution along the leaf edgea^, 



P' = Mi <g> M 2 



M L -P, 



(A-l) 



is simply the action of an element (Mi, M2, • • • , Ml) of the L-fold direct product group GxGx 



7 Here we do not need to consider the full phylogenetic tree model, which involves labelling all edges by stochastic 
matrices and appropri ate summations ove r characters. For remar ks on such internal structure in the context of 
Markov invariants see (|Sumner et all 120081 ; I Sumner fc Jarvisl . |2009| ) . 
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G. Where there is a rate matrix in continuous time models, or generically if the Mj are nonsingular, 
then G can simply be taken to be the general linear group GL(K). Specifically however, we restrict 
consideration to probability-conserving, stochastic ma trices fulfilling a unit columrJl sum condition, 



so the relevant matrix group is GL\(K), which from (jjohnsonl . Il985l : iMouradl . 120041 ) is isomorphic 
to the affine grourJl = GL(K - 1) x C K ~ l . 

Including the symmetric embeddings construction, the full chain of edge subgroups becomes 

GL{K) > GL X (K) > GLx(2) > GL(1). (A-2) 

The last step simply asserts that with a fixed rate matrix Q, a continuous-time Markov chain 
generically provide a K x K representation of the time evolution group R + ; the specialisation 
GLi(K) > GL\{2) describes our present '2 > K' situation, and would be replaced for example, 
by GLi(K) > GLx(K') in the l K' ^ K 1 case, K' < K. Formally then, the assertion (fA~^T]) is that 
of the full group branching rule for the reduction 

GL(K L ) > GL(K) x GL(K) x • • • x GL(K) (A-3) 

followed by the reduction (|A-2|) on each edge. From the formal point of view, the problem of dealing 
with polynomial functions of P, say of degree D, is thus to implement the corresponding branching 
rule for the group representations arising. Specifically the Markov invariants I(P), now for the 
embedded GL\{2) submodel, or GL\{K') in general, are in correspondence with the 1-dimensional 
representations occurring. Specifically, representing Mj = M(nii), with rrtj G GLi(K'), i = 1, • • • , L 
as the embedded K' <—)■ K submodel, then under (|A-1|) we have by definition 

I(P') = det(mi)"' 1 det{m 2 ) W2 ■ ■ ■ det{m L ) WL I(P) (A-4) 

for a Markov invariant of weight (w\, W2, ••• , vjl)- 

The two stages of the above representation-theoretic problem have been solved in lSumner et al. 



(2008), and for completeness we quote the relevant theorems. Firstly recall that the polynomial 



ring C[P] is isomorphic to the symmetric tensor algebra V(P), that is, at each degree, symmetric 
tensor powers of the module corresponding to P. 

Theorem 1: Polynomial covariants for embedded models. 

Consider the embedding GL(K L ) D x L GL(K') defined by the branching rule for the fundamental 
if^-dimensional representation 

{1}^{Ai}®{A 2 }®---<8>{A l }. 
The corresponding branching rule for the D'th symmetric tensor power is given by 

fTl*(T2*-"-*<7X9(-D) 

{D}^ Yl ({Ai}®W)®({A 2 }®{o-2})® ■■■»({ A^}®{o-i})- (A-5) 
Here standard partition notation A or (•) has been adopted for irreducibl e tensor represent ations, 



with {•} denoting the corresponding characters (symmetric functions, as in (jLittlewoodl . 1 19551 ) ) . The 
operation * of inner multiplication corresponds to the evaluation of tensor products of irreducible 
representations in the symmetric group &d for partitions <7j b D. The symbol (g> stands for the 
operation of plethys m on group characters. For a recent discussion of the calculus of plethysms see 
(jFauser et al\ . l200fil ). 



Or unit row sum condition; however we use left multiplication for our actions. 
9 We discuss groups and representions over C and regard real parametrisations as a separate issue. 
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The above result has been stated in full generality allowing for the possibility of heterogeneous 
edges (models with different numbers of characters). In practice we restrict attention to stan- 
dard embeddings K' K where K is the dimension of a certain irreducible representation A of 
GL(K'). Moreover, for the cases of interest, A is a symmetric tensor of rank k; specifically for 
2 K, K = k + 1. With the above branching rule in hand, the occurrence of one dimensional 
representations (including multiplicities) can be read off using the following result: 



Theorem 2: Polynomial invariants for phylogenetic models and embedded models. 

Linearly independent polynomial invariants at degree D of the groups (i) x L GL(K'), (ii) x L GL\(K') 
and (iii) x L GL\^(K') for the general phylogenetic model are given by the one dimensional mod- 
ules of these groups within polynomial representations of x L GL(K') corresponding to the following 
partitions: 

x L GL(K') : {r K '}®{r K '}®---®{r K '}, where K'r = kD, 

x L GL l {K') : {r l +s u rf- 1 }®{r 2 + S2,rf- 1 }®---®{r L + s L ,rf- 1 }, (A-6) 
where K rj + Sj = kD, 
x L GL 1A (K') : {n+surf- 2 ,^} ® {r 2 + s 2 ,rf- 2 ,t 2 } ® ■ ■ ■ ® {r L + s L ,rK'- 2 ,t L }, 
where (K — l)rj + Sj + ti = kD, 

The number of admissible partitions of the given forms {/^i}<8>{/X2}<S>- • in each case (i), (ii), 

(iii), deriving from (|A-5j) . is the number of times the inner product c\ * a 2 * ■ ■ ■ * o~l of irreducible 
representations of the symmetric group & d contains the one-dimensional irreducible representation 
(D). This is also the number of linearly independent polynomial invariants in each case. □ 



Clear ly the case k = 1, K' = K corresponds to the standard situation treated in (ISumner et al 



2008), whereas k ^ 1,K' < K covers embedded submodels. For the K ' = 2 model (general phy- 



logenetic model on 2 characters) and polynomial degree D, the Markov invariants (invariants of 
GLi(2), and a fortiori of GL{2)) occur as characters of the type {r, r} with 2r = kD (one dimen- 
sional tensor representations of GL{2)), denoted I rr , of weight r or more generally are associated 
with characters of the type {r + s, r} with 2r + s = kD, denoted I r + S ,r a ls° of weight r. 



A. 2 Enumeration of invariants for low-dimensional cases 

Finally we turn to the enumeration of invariants for concrete, low-dimensional, cases. For the 2^-3, 
k = 2, K' = 2, K = 3 case, admissible characters (partition shapes) {/ii} according Theorem 2 are 
obviously {4}, {2,2} and {3,1} at degree D = 2, and {6}, {5,1} and {4,2} and {3,3} at degree 
D = 3. Their precise occurrence is determined by the evaluations of the appropriate plethysms. 
The following are well known and can be checked on dimensional grounds: 

D = 2 : 

{2}®{2} = {4} + {2, 2}, {2}®{1 2 } = {3, 1}; 
D = 3 : 

{2}®{3} = {6} + {4, 2} + {2 3 }, {2}^{2, 1} = {5, 1} + {4, 2} + {3 2 }, {2}®{1 3 } = {41 2 } + {3, 3}. 

In turn, for L = 2 taxa, the possible invariants are associated with pairs {/*i} <8> {^2} for such 
admissible characters, weighted by a combinatorial factor (2 if {/ii} 7^ {^2}), and weighted by the 
(nonzero) multiplicity of the trivial representation (D) of the symmetric group &d in the reduction 
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of the inner product of the parent {<7i} * {cr 2 }. Given the rules 



D = 2 (in 6 2 ) : 

{2}* {2} ={2}, {2}*{1 2 } = {1 2 }, {1 2 }*{1 2 } = {2} 
D = 3 (in 63) : 

{3}* {3} ={3}, {3}* {2,1} = {2,1}, {2,1}*{2,1} = {3} + {1 3 } + {2,1}, 
{2,1}*{1 3 }={2,1}, {I 3 }* {I 3 } = {3}, 

it is clear that there are 5 = 2 2 + l 2 linearly independent invariants at quadratic degree (namely 
{4} ® {4}, {3, 1} ® {4}, {4} ® {3, 1}, {3, 1} ® {3, 1}, and {2, 2} ® {2, 2}), and 9 = 2 2 + 2 2 + l 2 
linearly independent invariants at cubic degree. At the quadratic level, {4} ® {4} just represents 
probability conservation for P, while {3, 1} ® {4} and {4} ® {3, 1} turn out to vanish identically for 
this casqlj. These results verify that at quadratic degree, both {2, 2} ® {2, 2} and {3, 1} ® {3, 1} 
occur in the appropriate plethysms for admissible inner products, and lead to the invariants 7~2,2 
and I3 1 constructed explicitly in $3l Their properties are explored numerically in the final §H 
via some simple simulation studies. At cubic level it turns out that there are various invariants 
involving {4, 2}, {5, 1} and two invariants {3, 3} ® {3, 3} (again, cases such as {6} ® {6} are trivial 
by probability conservation^ 1 "!) . One of the latter, associated with the (k = 1, K' = K = 3) GL(3) 
antisymmetric invariant, via {l 3 } * {l 3 } = {3}, is in fact identical to the determinant function. 

The above results for 2 3 prove to be the simplest of a plethora of cases, some of which may 
be of interest for phylogenetic applications (see the concluding remarks, §3]), but whose existence 
serves to illustrate our general philosophy. In $3]we record invariants for symmetric embeddings for 
diverse (low-dimensional) cases of initial and target models K', k and K (see table Q] for a tabulation 
of models and table[2]for an enumeration of invariants for them 12 !). All manipulations with products, 
plethysms and group branching rule s can be evaluate d symbolically using an appropriate group 
theory package. The program Schur, ( Wybourne . 20041 ) . works with combinatorial algorithms based 



on manipulations of the group characters encoded as the celebrated Schur functions (symmetric 
polynomials in n indeterminates representing the eigenvalues of a n x n matrix). 

The general algorithm for identification of higher invariants follows the above pattern. Con- 
sider for example the following plethvsni 1 " 3 ! at degree 5 for a rank 2 embedding: 

{2} ® {3, 2} = {82} + {73} + {721} + {64} + {631} + 2{62 2 } + {541} + {532}+ 
+ {531 2 } + {52 2 1} + {4 2 2} + {4321} + {42 3 } + {3 2 21 2 }. 

Thus admissible {//} for the 2 state model 2 > 3 are {82}, {73}, and {64}; for the 3 state mode _ 
3 6, we have {6, 2 2 } with multiplicity 2, and for the 4 state 4 ^ 10 model, the candidatqlf 
{42 3 }. Enumeration of invariants entails counting those products of admissible {/u} whose parent 
inner product J\i *{<7j}, weighted by the correct multinomial factor, of symmetric group characters 
in &d contains the trivial one-dimensional representation (D) (weighted by multiplicity > 1). 



10 They are examples of mixed weight invariant s which in principle give different information for each edge (other 
examples of such mixed invariants were noted in l)Sumner 

n In relation to the remarks about algebraic independence below, note that the product 74/2,2 of the linear invariant 
1 4 and the quadratic 72,2 is also of cubic degree and has weight w — 2. 

At least up to linear independence: formally they form a ring, b ut the question of algebraic independence is 
beyond the scope of the present investigation fsee lSumner et al\ (|2008l )V 

13 Using Schur. 

14 According to Theorem 2, for choices of parameters giving a symmetric model on 3 states, there would be additional 
candidates {721}, {631}, {541}, {532}, and {4 2 2}. 

15 Likewise for a starting symmetric 4 state model, there is the additional candidate {52 2 1}. 
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